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Abstract 

Current critical systems often use a lot of floating-point computations, 
£NJ , and thus the testing or static analysis of programs containing floating- 

O^l ' point operators has become a priority. However, correctly defining the 

semantics of common implementations of floating-point is tricky, because 
semantics may change according to many factors beyond source-code level, 
such as choices made by compilers. We here give concrete examples of 
problems that can appear and solutions for implementing in analysis soft- 
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in ■ 1 Introduction 

> . 

In the past, critical applications often used fixed-point computations. However, 
with the wide availability of processors with hardware floating-point units, many 
current critical applications (say, for controlling automotive or aerospace sys- 
tems) use floating-point operations. Such applications have to undergo stringent 
[ — . testing or validation. In this paper, we show how the particularities of floating- 

point implementations can hinder testing methodologies and have to be cared 
ryj , for in techniques based on program semantics, be them assisted proofs, or au- 

tomatic analysers. 

It has been known for a long time that it was erroneous to compute with 
floating-point numbers and operations as though they were on the real field. 
There exist entire treatises discussing the topic of stability in numerical al- 
gorithms from the point of view of the applied mathematician: whether or 
not some algorithm, when implemented with floating-point, will give "good" 
approximations of the real result; we will however not discuss such issues in 
this paper. The purpose of this paper is to show the kind of difficulties that 
floating-point computations pose for static analysis and program testing meth- 
ods: both for defining the semantics of the programs to be analysed, and for 
defining and implementing the analysis techniques. We shall not discuss "bugs" 
in floating-point implementations in microprocessors^ but, rather, how misun- 
derstandings and non-intuitive behaviours of correct hardware implementations 
affect the safety of programs and program analysis techniques. 

*Thc author is now at CNRS / VERIMAG, Grenoble. 

1 William Kahan has interesting cases, see his short paper Beastly Numbers. 
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Many of the issues that we discuss here are known to floating-point arith- 
metic experts. However, they have often been ignored or misunderstood by de- 
signers of programming languages, compilers, verification techniques and pro- 
gram analysers. Some of them were taken care of, at least partially, in the 
definition of Java and the latest standard for C, as well as modern hardware 
implementations. Our primary objective, however, is to educate the designers 
of verification systems, who in general do not have the luxury to change the 
hardware implementation, the programming language, the compilers used or 
the algorithms of the systems that they have to verify. Some of the issues we 
discuss pertain to the Intel™ 386, i86, Pentium™ lines (IA32™ architecture), 
which will most probably continue to be used for embedded work for a number 
of years. Because of this focus on program analysis for critical systems, we shall 
be particularly interested in programs written in the C programming language, 
because this language is often favoured for embedded systems. We shall in par- 
ticular discuss som e implications of the most recent standard of that language, 
"C99" poLllflflflj . In order to emphasise that the issues we discuss are real, we 



shall give specific program fragments as well as versions of compilers, libraries 
and operating systems with which "surprising" behaviours were witnessed. 

All current major microprocessor archit ectur e s (IA 32, x86_64, PowerPC) 
support IEEE-7 54 floating- point arithmetic [IEE, Il985l |. now also an interna- 



tional standard IECl . 1989} : microcontrollers based on these architectures also 



do so. Other microcontrollers and microprocessors often implement some variant 
of it with some features omitted. The specification for the default floating-point 
arithmetic in the Java programming language is also based on IEEE-754. For 
these reasons, we shall focus on this standard and its main implementations, 
though some of our remarks apply to other kinds of systems. For the sake of a 
better understanding, in Section[2j we recall the basics of IEEE-754 arithmetic. 

Despite, or perhaps because of, the prevalence of "IEEE-compliant" systems, 
there exist a number of myths of what IEEE-compliance really entails from the 
point of view of program semantics. We shall discuss the following myths, among 
others: 

• "Since C's float (resp. double) type is mapped to IEEE single (resp. 
double) precision arithmetic, arithmetic operations have a uniquely de- 
fined meaning across platforms." 

• "Arithmetic operations are deterministic; that is, if I do z=x+y in two 
places in the same program and my program never touches x and y in the 
meantime, then the results should be the same." 

• A variant: "If x < 1 tests true at one point, then x < 1 stays true later 
if I never modify x." 

• "The same program, strictly compliant with the C standard with no "un- 
defined behaviours" , should yield identical results if compiled on the same 
IEEE-compliant platform by different compliant compilers." 

A well-known cause for such unintuitive discrepancies is the 80-bit internal 
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floating-point registers on the Intel platform. [Sunl . 200 ll . Appendix D] In Sec- 
tion [3TT1 we shall expand on such issues and show, for instan ce, how low-level 
issues such as register allocation (Appel and Ginsburd . Il997l chapter 11] and 
the insertion of logging instructions with no "apparent" computational effects 
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can change the hnal results of computations. In Section 13.21 we shall discuss 
issues pertaining to the PowerPC architecture. 

An important factor throughout the discussion is that it is not the hardware 
platform that matters in itself, but its combination with the software context, 
including the compiler, libraries, and possible run-time environment. Compat- 
ibility has to be appreciated at the level of the application writer — how code 
written using types mapped to IEEE normalised format s will effect i vely behave 
when compiled and run. Indeed, the standard recalls [lECl . Il989l . llEEl Il985l 
§1.1]: 

It is the environment the programmer or user of the system sees that 
conforms or fails to conform to this standard. Hardware components that 
require software support to conform shall not be said to conform apart 
from such software. 

IEEE-754 standardises a few basic operations; however, many programs use 
functions such as sine, cosine, . . . , which are not specified by this standard and 
are generally not strictly specified in the system documentation. In Section UJ 
we shall explore some difficulties with respect to mathematical libraries. In 
addition to issues related to certain floating-point implementations, or certain 
mathematical libraries, there are issues more particularly related to the C pro- 
gramming language, its compilers and libraries. Section r4.3l explores such system 
dependencies. Section ^. 4l explores issues with input and output of floating-point 
values. 

A commonly held opinion is that whatever the discrepancies, they will be 
negligible enough and should not have noticeable consequences. In Section [5j 
we give a complete example of some seemingly innocuous floating-point code 
fragment based on real-life industrial code. We illustrate how the floating-point 
"oddities" that we explained in the preceding sections can lead to rare and 
extremely hard to diagnose run-time errors. 

While the focus of the paper is on the C programming language, in Section[6] 
we shall discuss a few aspects of the compilation of Java, a language reputedly 
more "predictable" , which many advocate for use in embedded systems. 

While in most of the paper we dismiss some incorrectly optimistic beliefs 
about how floating-point computations behave and what one can safely sup- 
pose about them for program analysis, an opposite misconception exists: that 
floating-point is inherently so complex and tricky that it is impossible to do any 
kind of sound analysis, or do any kind of sound reasoning, on programs using 
floating-point, except perhaps in very simple cases. By sound analysis (Sec. 17.2]) . 
we mean that when one analyses the program in order to prove properties (say, 
"variable x does not exceed 42), using some automated, semi-automated or 
manual proof techniques, then the results that are obtained truly hold of the 
concrete system (e.g. one does not prove the above statement when in reality 
x can reach 42.000000001). The Astree static analysed implements mathe- 
matically sound analyses by taking into account some "error bounds" derived 



2 Astree is a static analyser specialised on a subset of C suitable for many critical ap- 
plications. It features specific, sound and precise analyses for floating-point digital filters. 
It was successfully applied to several classe s of critical code, especially fly-by-wire software . 
See |http : // www. astree . ens . f r as well as Blanchct ct al., 2002, 2003, Cousot ct al., 2005]. 
Simply put, Astree takes as input the source code of a program, a specification of bounds 
on program inputs, and computes, symbolically, a super-set of reachable program states and 
possible program executions, from which it extracts properties interesting to the user, and 
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from the specification of the concrete floating-point semantics. The existence 
of Astree and its use in an industrial context demonstrate that it is possible 
to obtain sound analyses with a reasonable cost, at least for some classes of 
applications. We shall therefore refer to this tool as a practical example further 
on. 

In Section [7] we analyse the consequences of the issues discussed in previous 
sections on abstract interpretation-based static analysis and other validation 
techniques, and show how to obtain sound results. 



2 IEEE- 754: a reminder 

All current general-purpose microprocessors, and many microcontrollers, imple- 
ment hardware floating-point as a variant of standard A NSI/IEEE - 754 (iEeI . 
Il985j . later adopted as international standard IEC-60559 [lECl . ll989l ]. We thus 



begin by an overview of this standard. We shall not, however, describe how algo- 
rithms can make clever use of the standard for numerical computation purposes, 
and refer the reader to papers and presentations from e.g. William Kaharjf] on 
such issues. 



2.1 Numbers 

IEEE floating point numbers are of the following kinds: 

Zeroes There exist both a +0 and a —0. Most operations behave identically 
regardless of the sign of the zero, however one gets different results if one 
extracts the sign bit from the number, or if one divides a nonzero number 
by a zero (the sign of the zero and that of the nonzero other operand 
determine whether +oo or — oo is returned). 

Many programs do not depend on this feature of IEEE arithmetic and 
would behave identically if only a single zero was provided. This is the 
case of most programs implementing real (as opposed to complex) com- 
putations. However, discriminating between +0 and —0 allows for some 
simpler implementations of certain classes of algorithms. 

An i mportant exam ple of the use for separate ±0 is complex arith- 
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Kahanl . 11987] : if branch cuts are located along the real or complex 
axes, then distinguishing +0 and —0 allow making functions continuous 
from both sides of the slit, while having a single zero value introduces a 
discontinuity. As an example, consider Silog(z) with z — x + iy, x < 0: 
lim y ^ + 31og(x + iy) = +ir and lim^rj- 3dog(a; + iy) = — tt, and thus 
it makes sense to define by continuity from both sides: log(x + Oi) = +n 
and \og(x — 0i) — ~ir. This behaviour o f complex a rithmetic along branch 



cuts is mandated by the C99 standard [iSOl . (l999l §7.3.3]. 



Infinities Infinities are generated by divisions by zero or by overflow (compu- 
tations of numbers of such a large magnitude that they cannot be repre- 
sented) . 



proves that certain undesirable behaviours (overflow, array access out of bounds, bad pointer 
dereference, violation of assertion) are impossible. 
a http : //www. cs .berkeley . edu/~wkahan/ 
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NaNs The special values Not a Number (NaN) represent the result of opera- 
tions that cannot have a meaningful result in terms of a finite number or 



infinity. Such is for instance the case of (+00) — (+00), 0/0 or \J — 1. 



Normal numbers (Also known as normalised numbers.) These are the most 
common kind of nonzero representable reals. 

Subnormal numbers (Also known as denormalised numbers or denormals.) 
These represent some values very close to zero. They pose special issues 
regarding rounding errors. 

IEEE specifies 5 possible kinds of exceptions. These exceptions can, at the 
request of the programmer, be substituted for "silent" default responses: 

Invalid operation This is the exception corresponding to the conditions pro- 
ducing a NaN as silent response. 

Overflow This exception is raised when the result of an operation is a number 
too large in magnitude to be represented in the data type. The silent 
response is to generate ±00. 

Division by zero Raised by xj ± where x 7^ 0. The default response is to 
produce ±00, depending on the sign of x and of ±0. 

Underflow This exception is raised when a result is too small in magnitude to 
be computed accurately. This is generally harmless in programs; however, 
care must be taken when computing error bounds on floating-point com- 
putations, because error bounds on underflow differ from those on normal 
computations (see below). 

Inexact The "real" result of a computation cannot be exactly represented by a 
floating-point number. The silent response is to round the number, which 
is a behaviour that the vast majority of programs using floating-point 
numbers rely upon. However, rounding has to be correctly taking into 
account for sound analysis. 

In typical critical embedded critical programs, invalid operations, divisions 
by zero, and overflows are undesirable conditions. In such systems, inputs vary 
within controlled ranges, many variables model physical quantities (which also 
should lie within some reasonable range), and thus NaNs and ±00 should not 
appear. This is especially true since infinities may generate NaNs down the line 
(e.g. (+00) — (+00) = NaN), NaNs propagate throughout the algorithms, and 
converting a NaN to an output quantity (conversion to integer after scaling, for 
instance) yields an undefined result. For this reason, in most critical embedded 
systems, the generation of a NaN or infinity is an undesirable condition, which 
the designers of the system will want to rule out, including through formal 
methods (see Sec. [7]). The invalid operation, overflow and division by zero 
exceptions may be activated, so that they are raised as soon as an undesirable 
result is computed. They may trigger the termination of the program, or the 
running of a "degraded" version of it with limited functionality or complexity. 
This is why one of the features of analysers such as Astree is to detect where 
in programs such exceptions may be raised. In order to achieve this goal, the 
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analyser will have to bound variables and results of floating-point operations, 
which requires some sound analysis techniques and motivates this paper. 

Floating point numbers are represented as follows: x — ±s.m where 1 < 
m < 2 is the mantissa or significand, which has a fixed number p of bits, and 
s = 2 e the scaling factor (E m i n < e < E max is the exponent). The difference 
introduced by changing the last binary digit of the mantissa is ±s.£i as t where 
£i as t = 2~( p-1 ): the unit in the last place or ulp. Such a decomposition is unique 
for a given number if we impose that the leftmost digit of the mantissa is 1 — 
this is called a normalised representation. Except in the case of numbers of very 
small magnitude, IEEE-754 always works with normalised representations. 

T he IEEE-7 54 single precision type, generally associated with C's float 
type jlSOlll999L §F.2], has p = 24, E mili = -126, E max = +127. The IEEE-754 



single precision type, associated to C's double, has p = 53, E m - ln — —1022, 
£ max = +1023. 

We thus obtain normalised floating-point representations of the form: 

x = ±[l.mi . . .m p _i] 2 .2 e , (1) 

noting [uot]2 the representation of a number in terms of binary digits vvv. 

Conversion to and from decimal representation is delicate; special 
care must be taken in ord e r not to intr oduce inaccuracies or discrepan- 
cies. [Steele and Whitd . Il990l IClingeit Il990! . Beca use o f this, C99 introduces 
hexadecimal floating-point literals in source code. ISO . 19991 §6.4.4.2] Their 
syntax is as follows: Oxmmmmmm . mmmnnp± ee where mmmmmm.mmmm is a mantissa in 
hexadecimal, possibly containing a point, and ee is an exponent, written in deci- 
mal, possibly preceded by a sign. They are interpreted as [mmmmmm . mmmm\iQ x 2 ee . 
Hexadecimal floating-point representations are especially important when val- 
ues must be represented exactly, for reproducible results — for instance, for 
testing "borderline cases" in algorithms. For this reason, we shall use them in 
this paper wherever it is important to specify exact values. See also Section 
for more information on inputting and outputting floating-point values. 



2.2 Rounding 

Each real number x is mapped to a floating-point value r(x) by a uniquely de- 
fined rounding function; the choice of this function is determined by the rounding 
mode. 

Generally, floating-point units and libraries allow the program to change 
the current rounding mode; C99 mandates fegetroundO and f esetroundO 
to respectively get and set the current rounding mode on IEEE-compliant plat- 
forms. Other platforms (BSD etc.) provide fpgetroundO and f psetroundO . 
Finally, on some other platforms, it may be necessary to use assembly language 
instructions that directly change the mode bits inside the floating-point unito 
IEEE-754 mandates four standard rounding modes: 

• Round-to-+oo (directed rounding to +oo): r(x) is the least floating point 
value greater than or equal to x. 

4 The Astree static analyser, which uses directed rounding internally as described in 
Sec. 17.51 contains a module that strives to provide a way of changing rounding modes on 
most current, common platforms. We regret the uneven support of the standardised func- 
tions. 
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• Round-to — oo (directed rounding to — oo): r(x) is the greatest floating 
point value smaller than or equal to x. 

• Round-to-0: r{x) is the floating-point value of the same sign as x such 
that \r(x)\ is the greatest floating point value smaller than or equal to \x\. 

• Round-to- nearest: r{x) is the floating-point value closest to x with the 
usual distance; if two floating-point value are equally close to x, then r(x) 
is the one whose least significant bit is equal to zero. Rounding to nearest 
with different precisions twice in a row (say, first to double precision, then 
to single precision) may yield different results than rounding directly to 
the final type; this is known as double rounding. Some implications of 
double-rounding are investigated in Sec. 13.1.21 

The default mode is round-to-nearest, and this is the only one used in the 
vast majority of programs. 

The discussion on floating-point errors frequently refers to the notion of "unit 
in the last place" o r "ulp" ; but th ere are different definitions of this notion, with 



subtle differences [Mullen , [2005J . If the range of exponents were unbounded, 
there would exist a positive real e rc i such that, for all x, \x — r(x)\ < e Ic i.\x\. 
This relative error characterisation is used in some papers analysing the impact 
of roundoff errors. It is, however, incorrect in the context of IEEE-754, where the 
range of exponents is bounded: not only there is the possibility of overflow, but 
there exists a least positive represent able number, which int roduces an absolute 
error for values very close to zero [Ming l2004aL §7.4.3] [Ming , [2004b| . As a 
result, for any floating-point type with a bounded number of bits, there exist 
two positive reals e ro i and e a b s such that 

\x - r(x)\ < max(£ re i.|x|,£ a b 8 ) (2) 

The following, coarser, property may be easier to use in some contexts: 

| a; - r(x)\ < e Ic i.\x\ + e abs (3) 

2.3 Operations 

IEEE-754 standardises 5 operations: addition (which we shall note © in order to 
distinguish it from the operation over the reals), subtraction (©), multiplication 

((g)), division (0), and also square root. 

IEEE-754 specifies exact rounding [Goldbergl - 1991 , §1.5]: the result of a 



floating-point operation is the same as if the operation were performed on the 
real numbers with the given inputs, then rounded according to the rules in the 
preceding section. Thus, x © y is defined as r(x + y), with x and y taken as 
elements of R U {— oo, +oo}; the same applies for the other operators. 

One difficulty, when reasoning about floating-point computations, both with 
human and automated reasoning, is that floating-point operations "almost" 
behave like their name-sakes on the reals, but not quite exactly For instance, it 
is well-known that floating-point operations are not associative (e.g (10 20 ©1)0 
10 20 = ^ 1 = (10 20 © 10 20 ) © 1 using IEEE-754 double precision operations). 
Many symbolic computation techniques, used in compiler optimisers, program 
analysers or proof assistants, assume some good algebraic properties of the 
arithmetic in order to be sound; thus, if applied directly and straightforwardly 
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on floating-point operation, they are unsound. Yet, some compilers rely on 
such properties to perform optimisations (see I4.3.2|) . In Section 17.51 we shall 
explain how it is possible to make such analysis methods sound with respect to 
floating-point. 



3 Architecture-dependent issues 

Some commonplace architectures, or, more appropriately, some commonplace 
ways of compiling programs and using floating-point programs on certain com- 
monplace architectures, can introduce subtle inconsistencies between program 
executions. 



3.1 IA32, x86_64 architectures 

The IA32 architecture, originating from Intel, encompasses processors such as 
the i386, i486 and the various Pentium variants. It was until recently the most 
common architecture for personal computers, but has since been superseded 
for that usage by the x86_64 architecture, a 64-bit extension of IA320 IA32, 
however, is a very commonplace architecture for embedded systems, including 
with embedded variants of the i386 and i486 processors. IA32 offers almost 
complete upward compatibility from the 8086 processor, first released in 1978; it 
features a floating-point unit, often nicknamed x87, mostly upwardly compatible 
with the 8087 coprocessor, first released in 1980. 

Later, another floating-point unit, known as SSE, was added to the archi- 
tecture, with full support for IEEE-754 starting from the Pentium 4 processor; 
it is now the preferred unit to use. The use of the x87 unit is deprecated on 
x86_64 — for instance, the popular gcc compiler does not use it by default 
on this platform, and the documentation for Microsoft Visual Studio C++ on 
x86_64 calls this unit "legacy floating-point" . However, microcontrollers and 
embedded microprocessors are likely not to include this SSE unit in the years 
to come, even if they include x87 hardware floating-point. 



3.1.1 x87 floating-point unit 

Processors of the IA32 architecture (Intel 386, 486, Penti um etc. a nd com- 
patibles) feature a floating-point unit often known as "x87" Int, 20051 chapter 



It supports the floating-point, integer, and packed BCD integer data 
types and the floating-point processing algorithms and exception handling 
architecture defined in the IEEE Standard 754 f or Binary Floating-Point 
Arithmetic. 

This unit has 80-bit registers in "double extended" format (64-bit mantissa 
and 15-bit exponent), often associated to the long double C type; IEEE-754 
specifies the possibility of such a format. The unit can read and write data 



5 AMD made a 64-bit architecture called AMD64™. Intel then produced a mostly com- 
patible architecture, calling it Intel® 64 (not to be confused with IA64™, another, incom- 
patible, 64-bit architecture from Intel, found in the Itanium™ processor), after briefly calling 
it EM64T™. Microsoft Windows documentation calls these architectures x64. We chose a 
middle ground and followed the terminology of GNU/Linux distributions (x86_64). 
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to memory in this 80-bit format or in standard IEEE-754 single and double 
precision. 

By default, all operations performed on CPU registers are done with 64-bit 
precision, but it is possible to reduce precision to 24-bit (same as IEEE sin- 
gle precision) and 53-bit (same as IEE E double p recision) mantissas by setting 
some bits in the unit's control register. SqqE §8.1.5.2] These precision set- 



tings, however, do not affect the range of exponents available, and only affect a 
limited subset of operations (containing all operations specified in IEEE-754) . 
As a result, changing these precisions settings will not result in floating-point 
operations being performed in strict IEEE-754 single or double precision^ 

The most usual way of generating code for the IA32 is to hold temporaries — 
and, in optimised code, program variables — in the x87 registers. Doing so yields 
more compact and efficient code than always storing register values into memory 
and reloading them. However, it is not always possible to do everything inside 
registers, and compilers then general ly spill extra temporary values to main 
memory, Appel and Ginsburgl 1997L chapter 11] using the format associated 



to the type associated to the value by the typing rules of the language. For 
instance, a double temporary will be spilt to a 64-bit double precision memory 
cell. This means that the final result of the computations depend on how the 
compiler allocates registers, since temporaries (and possibly variables) will incur 

or not incur rounding whether or not they are spilt to main memory. 

As an example, the following program compiled with gcc 4.0.1 Frel . 2005b] 
under Linux will print 10 308 (1E308): 

double v = 1E308; 
double x = (v * v) / v; 
printfdg 7„d\n\ x, x==v) ; 

How is that possible? v * v done in double precision will overflow, and thus 
yield +00, and the final result should be +00. However, since all computations 
are performed in extended precision, with a larger exponent range, the compu- 
tations do not overflow. If we use the -f float-store option, which forces gcc 
to store floating-point variables in memory, we obtain +00. 

The result of computations can actually depend on compilation options or 
compiler versions, or anything that affects propagation. With the same compiler 
and system, the following program prints 10 308 (when compiled in optimised 
mode (-0), while it prints +00 when compiled in default mode. 

double f 00 (double v) { 

double y = v * v; 
return (y / v) ; 

} 



mainO { printf 07.g\n" , f 00 (1E308) ) ; } 

Examination of the assembly code shows that when optimising, the compiler 
reuses the value of y stored in a register, while it saves and reloads y to and 
from main memory in non-optimised mode. 

6 By "strict IEEE-754 behaviour", "strict IEEE-754 single precision" or "strict IEEE-754 
double precision" , we mean that each individual basic arithmetic operation is performed as if 
the computation were done over the real numbers, then the result rounded to single or double 
precision. 
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A common optimisation is inlining — that is, replacing a call to a func- 
tion by the expansion of the code of the function at the point of call. For 
simple functions (such as small arithmetic operations, e.g. x i— ► x 2 ), this can 
increase performance significantly, since function calls induce costs (saving reg- 
isters, pass ing parameters, performing the call, handling return values). C99 
ISO . 19991 . §6.7.4] and C++ have an inline keyword in order to pinpoint func- 



tions that should be inlined (however, compilers are free to inline or not to 
inline such functions; they may also inline other functions when it is safe to 
do so). However, on x87, whether or not inlining is performed may change the 
semantics of the code! 

Consider what gcc 4.0.1 on IA32 does with the following program, depending 
on whether the optimisation switch -0 is passed: 

static inline double f (double x) { 
return x/lE308; 

} 



double square (double x) { 
double y = x*x; 
return y; 

} 



int main (void) { 

printf ("7„g\n", f (square (1E308))) ; 

} 

gcc does not inline functions when optimisation is turned off. The square 
function returns a double, but the calling convention is to return floating point 
values into a x87 register — thus in long double format. Thus, when square 
is called, it returns approximately 10 , which fits in long double but not 
double format. But when f is called, the parameter is passed on the stack — 
thus as a double, +oo. The program therefore prints +oo. In comparison, if the 
program is compiled with optimisation on, f is inlined; no parameter passing 
takes place, thus no conversion to double before division, and thus the final 
result printed is 10 308 . 

It is somewhat common for beginners to add a comparison check to before 
computing a division, in order to avoid possible division-by-zero exceptions or 
the generation of infinite results. A first objection to this practise is that, any- 
way, computing 1/x for x very close to zero will generate very large numbers 
that will most probably result in overflows later. Indeed, programmers lack- 
ing experience with floating-point are advised that they should hardly ever use 
strict comparison tests (=, 7^, < and > as opposed to < and >) with floating- 
point operands, as it is somewhat pointless to exclude some singularity point by 
excluding one single value, since it will anyway be surrounded by mathematical 
instability or at least very large values which will break the algorithms. Another 
objection, which few programmers know about and that we wish to draw at- 
tention to, is that it may actually fail to work, depending on what the compiler 
does — that is, the program may actually test that x ^ 0, then, further down, 
find that x = without any apparent change to x. How can this be possible? 

Consider the following source code (see Section 12.11 for the meaning of hex- 
adecimal floating-point constants): 
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/* zero_nonzero . c */ 

void do_nothing (double *x) { } 

int main (void) { 

double x = Oxlp-1022, y = OxlplOO, z; 
do_nothing(&y) ; 
z = x / y; 
if (z != 0) { 

do_nothing(&z) ; 

assert (z ! = 0) ; 

} 

} 

This program exhibits different behaviours depending on various factors, 
even when one uses the same compiler (gcc version 4.0.2 on IA32): 

• If it is compiled without optimisation, x / y is computed as a long 
double then converted into a IEEE-754 double precision number (0) in 
order to be saved into memory variable z, which is then reloaded from 
memory for the test. The if statement is thus not taken. 

• If it is compiled as a single source code with optimisation, gcc performs 
some kind of global analysis which understands that do_nothing does 
nothing. Then, it does constant propagation, sees that z is 0, thus that 
the if statement is not taken, and finally that mainO performs no side 
effect. It then effectively compiles mainO as a "no operation". 

• If it is compiled as two source codes (one for each function) with opti- 
misation, gcc is not able to use information about what do_nothing() 
does when compiling main ( ) . It will thus generate two function calls 
to do_nothing() , and will not assume that the value of y (respectively, 
z) is conserved across do_nothing(&y) (respectively, do_nothing(&z)). 
The z ! = test is performed on a nonzero long double quantity and 
thus the test is taken. However, after the do_nothing(&z) function call, 
z is reloaded from main memory as the value (because conversion to 
double-precision flushed it to 0). As a consequence, the final assertion 
fails, somehow contrary to what many programmers would expect. 

• With the same compilation setup as the last case, removing 
do_nothing(&z) results in the assertion being true: z is then not flushed 
to memory and thus kept as an extended precision nonzero floating-point 
value. 

One should therefore be extra careful with strict comparisons, because the 
comparison may be performed on extended precision values, and fail to hold later 
after the values have been converted to single or double precision — which may 
happen or not depending on a variety of factors including compiler optimisations 
and "no-operation" statements. 

We are surprised by these discrepancies. After all, the C specification says 
IS0Lll999l 5 .1.2.3, program execution, §12, ex. 4]: 



Implementations employing wide registers have to take care to hon- 
our appropriate semantics. Values are independent of whether they are 
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represented in a register or in memory. For example, an implicit spilling 
of a register is not permitted to alter the value. Also, an explicit store 
and load is required to round to the precision of the storage type. 



However, this paragraph, being an example, is not normative. ISO . 19991 fore- 



word, §6]. By reading the C specification more generally, one gets the impression 
that such hidden side-effects ("hidden", that is, not corresponding to program 
statements) are prohibited. 

The above examples indicate that common debugging practises that appar- 
ently should not change the computational semantics may actually alter the 
result of computations. Adding a logging statement in the middle of a compu- 
tation may alter the scheduling of registers, for instance by forcing some value to 
be spilt into main memory and thus undergo additional rounding. As an exam- 
ple, simply inserting a printf ("°/g\n" , y) ; call after the computation of y in 
the above square function forces y to be spilt to memory, and thus the final re- 
sult then becomes +00 regardless of optimisation. Similarly, our do_nothing() 
function may be understood as a place-holder for logging or debugging state- 
ments which are not supposed to affect the state of the variables. 

In addition, it is commonplace to disable optimisation when one intends to 
use a software debugger, because in optimised code, the compiled code corre- 
sponding to distinct statements may become fused, variables may not reside 
in a well-defined location, etc. However, as we have seen, simply disabling or 
enabling optimisation may change computational results. 

3.1.2 Double rounding 

In some circumstances, floating-point results are rounded twice in a row, first 
to a type A then to a type B. Surprisingly, such double rounding can yield 
different results from direct rounding to the destination type0 Such is the case, 
for instance, of results computed in the long double 80-bit type of the x87 
floating-point registers, then rounded to the IEEE double precision type for 
storage in memory. In round-to-0, round-to— (-00 and round-to — 00 modes, this 
is not a problem provided that the values representable by type B are a subset 
of those representable by type A. However, in round-to-nearest mode, there 
exist some borderline cases where differences are exhibited. 

In order to define the round-to-nearest mode, one has to define arbitrarily 
how to round a real exactly in the mid dle between the neares t floating-point 
values. IEEE- 754 chooses round-to-even [iECt Il989l IIeeI Il985l §4.1] H 



In this mode, the representable value nearest to the infinitely precise 
result shall be delivered; if the two nearest representable values are equally 
near, the one with its least significant bit equal to zero shall be delivered. 

This definition makes it possible for double rounding to yield different results 
than single rounding to the destination type. Consider a floating-point type B 
where two consecutive values are xq and xq + Sb, and another floating- type A 
containing all values in B and also xo + <$b/2. There exists Sa > such that all 
reals in the interval I = (xq + Sb/2 — S a/ '2, x + <5g/2) get rounded to xq + Sb /2 



7 This problem h as been known for a long time. Fieucroa del Cid], |2000| . chap- 
ter 6llGoldbergl|l99ll . 4.2] 

8 [Goldberd . 11991. 1.5] explains a rationale for this. 
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when mapped to type A. We shall suppose that the mantissa of xo finishes by 
a 1. If x € /, then indirect rounding yields: x (xq + <5b/2) ^b (%o + $b) 
and direct rounding yields: x —>b xq. 

A practical example is with xq = 1 + 2~ 52 , S = 2~ 52 and r = xo + y where 
y = (<5/2)(l — 2~ n ). Both xo and y are exactly representable in IEEE-754 
double precision (B). 

double xO = 0x1 . OOOOOOOOOOOOlpO; 
double y = Oxlp-53 * (1. - Oxlp-11) ; 
double zl = xO + y; 

double z2 = (long double) xO + (long double) y; 
printf ("°/.a °/.a\n" , zl, z2) ; 

In order to get strict IEEE-754 double precision computations for the double 
type, we execute double-precision computations on the SSE unit (see Sec- 
tion [2HI3I) of an x86_64 or Pentium 4 processor. We then obtain that Z\ = xq 
and that z-i — xq + 2~ 52 : both Z\ and Z2 are double values obtained by ap- 
parently identical computations (the sum of xq and y), but the value that z-i 
holds will have been doubly rounded (first to extended precision, then to double 
precision for storing to z%) while Z\ holds a value directly rounded to double 
precision. 

Note that, on IA32, depending on compilation modes, the above discrepancy 
may disappear, with both values undergoing double rounding: on IA32 Linux 
/ gcc-4.0.1 with default options, the computations on double will be actually 
performed in the long double type inside the x87 unit, then converted to IEEE 
double precision. There is thus no difference between the formulae computing 
z\ and Zi. 

Another example was reported as a bug by a user who noticed inconsistencies 
between 387 and SSE but did not identify the source of the problem: 

double b = 0x1 .f ff ffff ffffffp-1; 
double x = 1 / b; 

Let s = 2~ 53 , then b = 1 — e. 1/6 = 1 + s + s 2 + . . . ; the nearest double 
precision numbers are 1 and 1 + 2s and thus direct rounding gives x = 1 + 2s. 
However, rounding to extended precision will give 1+e, which is rounded to 1 
when converting to double precision. 

A similar problem occurs with rounding behaviour n ear infinities: see the 
definition of round-to- nearest for large values [iEEl . [l985l . §4.1]: 

However, an infinitely precise result with magnitude at least 2 Emaz {2 — 
2 _p ) shall round to oo with no change in sign. 

For IEEE double-precision, E max — 1023 and p — 53; let us take xq to be the 
greatest representable real, M doub i c = 2^^(2-2-^-^) and y = 2 970 (l-2- n ). 
With a similar program as above, r = xo + y gets rounded to z% = xq in IEEE 
double precision, but gets rounded to 2 Bmax (2 — 2~ p ) in extended precision. As 
a result, the subsequent conversion into IEEE double precision will yield +oo. 

Double rounding can also cause some subtle differences for very small num- 
bers that are rounded into subnormal double-precision values if computed in 
IEEE-754 double precision: if one uses the "double-precision" mode of the x87 
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FPU, these numbers will be rounded into normalised values inside the FPU reg- 
ister, because of a wider range of negative exponents; then they will be rounded 
again into double-precision subnormals when written to memory. This is known 
as double-rounding on underflow. Working around double-rounding on under- 
flow is somewhat tedious (however, the phenomen on is exhibited by x and /, 
not by + and — ). [Java Grande Forum Panel , 1998] A concrete example : taking 



x = 0x1 . 8000000000001p-1018 w 5.34018 x 1CT 307 
y = 0x1 . 0000000000001p+56 « 7.20576 x 10 16 , 

then x0y = 0x0 . 0000000000001p-1022 in strict IEEE-754 double precision and 
x0y = 0x0 . 0000000000002p-1022 with the x87 in "double precision mode" . 

3.1.3 SSE floating-point unit 

Intel introduced the SSE floating-point unit Int, 20051 chapter 10] i n the 



Pentium III processor, then the SSE2 extension in the Pentium 4 jlntl . 120051 
chapter 11]. These extensions to the x86 instruction set contain, respectively, 
IEEE-compatible single-precision and double-precision instructions^ One can 
make gec generate code for the SSE subsystem with the -mf pmath=sse option; 
since SSE is only available for certain processors, it is also necessary to specify, 
for instance, -march=pentium4. On x86_64, -mf pmath=sse is the default, but 
-mf pmath=387 forces the use of the x87 unit. 

Note the implication: the same program may give different results when 
compiled on 32-bit and 64-bit "PCs" (or even the same machine, depending on 
whether one compiles in 32-bit or 64-bit mode) because of the difference in the 
default floating-point subsystem used. 

The differences may seem academi c, but the fo l lowing incident proves they 
are not. The Objective Caml system Lerov et all l2005j includes two compil- 



ers: one compiles Caml code into a portable bytecode, executed by a virtual 
machine (this virtual machine is written in C); the other one compiles directly 
to native assembly code. One user of Objective Caml on the recent Intel-based 
Apple Macintosh computers complained of a mysterious "bug" to the Caml 
maintainers: the same program gave slightly different results across the byte- 
code and native code versions. The problem could not be reproduced on other 
platforms, such as Linux on the same kind of processors. It turned out that, as 
all Intel-based Macintosh machines have a SSE unit, Apple configured gec to 
use the SSE unit by default. As a consequence, on this platform, by default, 
the Objective Caml virtual machine uses the SSE unit when executing bytecode 
performing floating-point computations: for instance, an isolated floating-point 
addition will be performed as a sequence of two loads from double-precision 
operands, addition with the result in a double-precision register, and then save 
to a double-precision variable. The native code compiler, however, uses the x87 
unit: the same floating-point addition is thus performed as a sequence of two 
loads from double-precision operands, addition with the result in an extended- 
precision register, and then save to a double-precision variable. As we pointed 



9 In addition to scalar instructions, SSE and SSE2 introduce various vector instructions, 
that is, instructions operating over several operands in parallel. We shall see in i)4.3.2l that 
compilers may rearrange expressions incorrectly in order to take advantage of these vector 
instructions. For now, we only discuss the scalar instructions. 
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out in the preceding section, these two sequences of operations are not equiv- 
alent in the default round-to-nearest mode, due to double rounding. It turned 
out that the user had stumbled upon a value resulting in double rounding. The 
widespread lack of awareness of floating-point issues resulted in the user blaming 
the discrepancy on a bug in Objective Caml! 

In addition, the SSE unit offers some non-IEEE -754 compliant modes for 
better efficiency: with the flush-to-zero flag (lnt] . l2005l §10.2.3.3] on, subnormals 
are not generated and are replaced by zeroes; this is more efficient. As we 
noted in Section |2~2"1 this does not hamper obtaining good bounds on the errors 
introduced by floating-point computations; also, we can assume the worst-case 
situation and suppose that this flag is on when we derive error bounds. 

The flush-to-zero flag, however, has another notable consequence: x y = 
is no longer equivalent to x = y. As an example, if x = 2 ~ 1022 and y — 
1.5 x 2 -1022 , then yQx — 2 -1023 in normal mode, and yOx — in flush-to-zero 
mode. Analysers should therefore be careful when replacing comparisons by 
"equivalent" comparisons. 

In addition, there exists a denormals- are- zero flag [hul 120051 . §10.2.3.4]: if 



it is on, all subnormal operands are considered to be zero, which improves 
performance. It is still possible to obtain bounds on the errors of floating point 
computations by assuming that operands are offset by an amount of at most 
_|-2 emin- (P -1 ) before being computed upon. However, techniques based on exact 
replays of instruction sequences will have to replay the sequence with the same 
value of the flag. 



3.1.4 Problems and solutions 

We have shown that computations on the float (respectively, double) types 
are not performed on the x87 unit exactly as if each atomic arithmetic operation 
were performed with IEEE-754 single (respectively, double precision) operands 
and result, and that what is actually performed may depend on seemingly irrel- 
evant factors such as calls to tracing or printing functions. This goes against the 
widespread myth that the result of the evaluation of a floating-point expression 
should be the same across all platforms compatible with IEEE-754. 

This discrepancy has long been known to some people in the programming 
language community, and some "fixes" have been proposed. For instance, gcc 
has a -f float-store option, fl ushing float ing-point variables to memory. [Fry, 
2005bl ] Indeed, the gcc manual fFrel . l2005bl ] says: 

On 68000 and x86 systems, for instance, you can get paradoxical re- 
sults if you test the precise values of floating point numbers. For example, 
you can find that a floating point value which is not a NaN is not equal 
to itself. This results from the fact that the floating point registers hold a 
few more bits of precision than fit in a double in memory. Compiled code 
moves values between memory and floating point registers at its conve- 
nience, and moving them into memory truncates them. You can partially 
avoid this problem by using the -f float-store option. 

The manual refers to the following option: 

-f float-store Do not store floating point variables in registers, and 
inhibit other options that might change whether a floating point value is 
taken from a register or memory. 
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This option prevents undesirable excess precision on machines [. , . ] 
where the floating registers [. . . J keep more precision than a 'double ' is 
supposed to have. Similarly for the x86 architecture. For most programs, 
the excess precision does only good, but a few programs rely on the pre- 
cise definition of IEEE floating point, [sic] Use '-f float-store' for such 
programs, after modifying them to store all pertinent intermediate com- 
putations into variables. 

Note that this option does not force unnamed temporaries to be flushed to mem- 
ory, as shown by experiments. To our knowledge, no compiler offers the choice 
to always spill temporaries to memory, or to flush temporaries to long double 
memory, which would at least remove the worst problem, which is the non- 
reproducibility of results depending on factors independent of the computation 
code (register allocation differences caused by compiler options or debugging 
code, etc.). We suggest that compilers should include such options. 

Unfortunately, anyway, systematically flushing values to single- or double- 
precision memory cells do not reconstitute strict IEEE-754 single- or double- 
precision rounding behaviour in round-to-nearest mode, because of the double 
rounding problem (|3.1.2p . In addition, the -f float-store option is difficult 
to use, because it only affects program variables and not temporaries: to ap- 
proximate strict IEEE-754 behaviour, the programmer would have to rewrite 
all program formulae to store temporaries in variables. This does not seem to 
be reasonable for human-written code, but may be possible with automatically 
generated code — it is frequent that control/command appl ications are i mple- 
mented in a high-level language suc h as Simuli nlf^'l Lustre Caspi et al. I. Il987j 
or Scade™E] then compiled into C (iSOl . Il999j ]. 

Another possibility is to force the floating-point unit to limit the width of 
the mantissa to that of IEEE-754 basic formats (single or double precision) F^l 
This mostly solves the double-rounding problem. However, there is no way to 
constrain the range of the exponents, and thus these modes do not allow exact 
simulation of strict computations on IEEE single and double precision formats, 
when overflows and underflows are possible. For instance, the square program 
of Sec. I3.1.1| which results in an overflow to +00 if computations are done on 
IEEE double precision numbers, does not result in overflows if run with the x87 
in double precision mode. Let us note, however, that if a computation never 
results in overflows or underflows when done with IEEE-754 double-precision 
(resp. single-) arithmetic, it can be exactly simulated with the x87 in double- 
precision (resp. single) modeF^I 

If one wants semantics almost exactly faithful to strict IEEE-754 single or 



10 Simulink™is a tool for modelling dynamic systems and control applications, using e.g. 
networks of numeric niters. The control part may then be compiled to hardware or software, 
http: / /www.mathworks . com/products/simulink/ 

ii Scade is a commercial product based on LUSTRE, 
http: / /www . esterel-technologies . com/products/scade- suite/ 

li! This is the default setting on FreeBSD 4, presumably in order to achieve closer compati- 
bility with strict IEEE-754 single or double precision computations. 

13 Let us consider the round-to-nearest case. If |r| < M(j ou bie (where M^ ou ^i c is the greatest 
representable double precision number), then the x87 in double-precision mode rounds exactly 
like IEEE-754 double-precision arithmetic. If M doublc < |r| < 2 E «"™{2-2~P), then, according 
to the round-to-nearest rules (including "round-to-even" for r = 2 Bmax (2 — 2~ p )), r is rounded 
to ±M doublc on the x87, which is correct with respect to IEEE-754. If \r\ > 2 Bmax (2 - 2 _p ), 
then rounding r results in an overflow. The cases for the other rounding modes are simpler. 
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double precision computations in round-to-nearest mode, including with respect 
to overflow and underflow conditions, one can use, at the same time, limitation 
of precision and options and programming style that force operands to be sys- 
tematically written to memory between floating-point operations. This incurs 
some performance loss; furthermore, there will still be slight discrepancy due 
to double rounding on underflow. A simpler solution for current personal com- 
puters is simply to force the compiler to use the SSE unit for computations on 
IEEE-754 types; however, most embedded systems using IA32 microprocessors 
or microcontrollers do not use processors equipped with this unit. 

Yet another solution is to do all computations in long double format. This 
solves all inconsistency issues. However, long double formats are not the same 
between processors and operating systems, thus this workaround leads to porta- 
bility problems. 



3.2 PowerPC architecture 

The floating point operatio ns impleme nted in the PowerPC arch itecture are 
compatible with IEEE-754 [FreL l2001bl §1.2.2.3, §3.2]. However, [Prel . l2001bl 
§4.2.2] also points out that: 

The architecture supports the IEEE-754 floating-point standard, but 
requires software support to conform with that standard. 

T he Pow erPC architecture features floating-point multiply-add instructions 



Frd . l2001bl §4.2.2.2]. These perform (a, 6, c) i— ► ±a.b± c computations in one 
instruct ion — with obvious benefits for computations such as matrix compu- 
Cormen et all Il990, §26 . 11, do t products, or Horner's rule for evalu- 



tations 



ating polynomials Cormen et al. . 199Ct §32.1]. Note, however, that they are 



not semantically equivalent to performing separate addition, multiplication and 
optional negate IEEE-compati ble instruct ions; in fact, intermediate results are 



computed with extra precision [Frd . l2001bl D.2]. Whether these instructions are 



used or not depends on the compiler, optimisation options, and also how the 
compiler subdivides instructions. For instance, gec 3.3 compiles the following 
code using the multiply-add instruction if optimisation f-Dj is turned on, but 
without it if optimisation is off, yielding different semantics! 14 ! 

double dotProduct (double al, double bl, 

double a2, double b2) { 
return al*bl + a2*b2; 

} 

In addition, the fpscr control register has a NI bit, which, if on, possi- 
bly enabl es impleme ntation-dependent semantics different from IEEE-754 se- 
mantics. [Frel . |2001~dI §2.1.4]. This alternate semantics may include behaviours 
disallowed by IEEE-754 regardless of which data formats and precisions are 
used. For instance, on the MPC750 family, such non-compliant behaviour en- 
compasses flushing subnormal result s to zero, ro unding subnormal operands to 
zero, and treating NaNs differently [Frd . l2001aL §2.2.4]. Similar caveats apply 
as in Section [3. 1.31 

14 gcc has an option -mno-f used-madd to turn off the use of this instruction. 
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4 Mathematical functions 



Many operations related to floating-point are not implemented in hardware; 
most programs using floating-point will thus rely on suitable support libraries. 
Our purpose, in this section, is not to comprehensively list bugs in current 
floating-point libraries; it is to illustrate, using examples from common operating 
systems and runtime environments, the kind of problems that may happen. 



4.1 Transcendental functions 

A first remark is that, though IEEE-754 specifies the behaviour of elementary 
operations +, — , x , / and it does not specify the behaviour of other functions, 
including the popular trigonometric functions. These are generally supplied by 
a system library, occasionally by the hardware. 

As an example, consider the sine function. On the x87, it is implemented in 
hardware; on Linux IA32, the GNU libc function sin() is just a wrapper around 
the hardware call, or, depending on compilation opti ons, can b e replaced by 
inline assembly^ Intel (intl . 120051 . §8.3.10] and AMD [Advl . 120051 §6.5.5] claim 
that their transcendental instructions (on recent processors) commit errors less 
than 1 ulp in round-to-nearest mode; however it is to be understood that this is 
after the operands of the trigonometric functions are reduced m odulo 27r, which 
is done currently using a 66-bit approximation for it. (intl . l2005l 58. 3.81 Howeve r, 



the AMD-K5 used up to 256 bits for approximating tt/2. Lynch et al. . 1995| 

One obtains different results for sin(0xl969869861 .p+0) on PCs run- 
ning Linux. The Intel Pentium 4, AMD Athlon64 and AMD Opteron pro- 
cessors, and GNU libc running in 32-bit mode on IA32 or x86_64 all yield 
-0x1 . 95b011554d4b5p-l, However, Mathematica, Sun Sparc under Solaris and 
Linux, and GNU libc on x86_64 (in 64-bit mode) yield -0x1 . 95b0115490ca6p-l. 

A more striking example of discrepancies is sin(p) where p — 14885392687. 
This value was chosen so that sin(p) is close to 0, in order to demonstrate 
the impact of imprecise reduction modulo 2ir\^ Both the Pentium 4 x87 and 
Mathematica yield a result sin(p) rj 1.671 x 10~ 10 . However, GNU libc on 
x86_64 yields sin(p) « 1.4798 x 1CT 10 , about 11.5% off! 

Note, also, that different processors within the same architecture can imple- 
ment the same transcendental functions with different accuracies. We already 
noted the difference between the AMD-K5 and the K6 and following processors 
with respect to angular reduction. Intel also notes that the algorithms' pre- 
cision was i mproved b etween the 80387 / i486DX processors and the Pentium 



processors. 



Intl. 119971 §7.5.10] 



With the Intel4-86 processor and Intel 387 math coprocessor, the worst- 
case, transcendental function error is typically 3 or 3.5 ulps, but is some- 
times as large as 4-5 ulps. 



15 The latter behaviour is triggered by option -f fast-math. The documentation for this 
function says it can result in incorrect output for programs which depend on an exact imple- 
mentation of IEEE or ISO rules/specifications for math functions. 

16 Consider a rational approximation of n, i.e. integers p and q such tha t p/q tt (such a n 
approximation can be obtained by a continued fraction development of 7r [Weisstcin , l2005f | ) . 
sin(p) pa sin(g7r) = 0. If p', the result of reduction modulo 2n of p, is imprecise by a margin of 
e (3k p' — p = e + 2kn), then sin(p') — sin(p) R* e (sin(a;) ~ x close to 0). Such inputs are thus 
good candidates to illustrate possible lack of precision in the algorithm for reduction modulo 
2tt. 
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There thus may be floating-point discrepancies between the current Intel cm- 
bedded processors (based on i386 / i387) and the current Pentium workstation 
processors. 

To summarise, one should not expect consistent behaviour of transcendental 
functions across libraries, processor manufacturers or models, although recent 
developments such as the MPFR librarjf^ provide exactly rounded transcen- 
dental functions. In any case, static analysis tools should never assume that the 
libraries on the analysis host platform behave identically to those on the target 
platform, nor, unless it is specified in the documentation, that they fulfil prop- 
erties such as monotonicity (on intervals where the corresponding mathematical 
function is monotonic). 

4.2 Non-default rounding modes 

We also have found that floating-point libraries are often poorly tested in "un- 
common" usage conditions, such as rounding modes different from "round-to- 
nearest" ; or, perhaps, that they are not supposed to work in such modes, but 
that this fact is not reflected adequately in documentation. This is especially 
of interest for implementers of static analysers (Section l7.5|) . since some form of 
interval arithmetic will almost certainly be used in such software. 

FreeBSD 4.4 provides the fpsetroundO function to set the rounding mode 
of the processor. This function is the BSD counterpart of the C99 f esetroundO 
function. However, the printf () standard printing function of the C library 
does not work properly if the processor is set in round-to-+oo mode: when one 
attempts to print very large values (such as 10 308 ), one can get garbage output 
(such as a colon in a location where a digit should be, e.g. : e+307). 

On GNU libc 2.2.93 on IA32 processors, the f esetroundO function only 
changed the rounding mode of the x87 FPU, while the gcc compiler also offered 
the possibility of compiling for SSE. 

On GNU libc 2.3.3 on x86_64, computing x v using the pow() function in 
round-to-+oo mode can result in a segmentation violation for certain values 
of x and y, e.g. x — 0xl.3d027p+6 and y = 0xl.555p-2. As for the exp() 
exponential function, it gives a result close to 2 502 on input 1, and a negative 
result on input 0xl.75p+0. The problems were corrected in version 2.3.5. 

4.3 Compiler issues 

The C standard is fairly restrictive with respect to what compilers are allowed or 
not allowed to do with floating-point operations. Unfortunately, some compilers 
may fail to comply with the standard. This may create discrepancies between 
what really happens on the one hand, and what users and static analysis tools 
expect on the other hand. 

17 MPFR, available from http://www.mpfr.org, is a library built on top of the popular 
GNU MP multiprecision library. MPFR provides arbitrary precision floating-point arithmetic 
with the same four rounding modes as IEEE- 754. In addition to the basic arithmetic opera- 
tions specified by IEEE- 754, MPFR also provides various arbitrary precision transcendental 
functions with guaranteed rounding. 
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4.3.1 Standard pragmas 

The C standard, by default, allows the compiler some substantial leeway in the 
way that floating-point expressions may be evaluated. While outright simpli- 
fications based on operator associativ ity a r e not permitted, since these can be 
very unsound on floating-point types ISOl 19991 . §5.1.2.3 #13], the compiler is 



for instance permitted to replace a complex expression by a simpler one, for 
instance using comp ound operators (e.g. the fused multiply-and-add in Sec- 



tion 



mg corrn > 
ISQ 11990 . 6.5]: 



A floating expression may be contracted, that is, evaluated as though it 
were an atomic operation, thereby omitting rounding errors implied by the 
source code and the expression evaluation method. [. . . ] This license is 
specifically intended to allow implementations to exploit fast machine in- 
structions that combine multiple C operators. As contractions potentially 
undermine predictability, and can even decrease accuracy for containing 
expressions, their use needs to be well-defined and clearly documented. 

While some desirable properties of contracted expressions (lSOl . ll999L §F.6] arc 
requested, no precise behaviour is made compulsory. 

Because of the inconveniences that discrepancies can create, t he st a ndard 
also mandates a special directive, #PRAGMA STDC FP_CDNTRACT (iSOl . Il999l 



§7.12.2], for controlling whether or not such contractions can be performed. 
Unfortunately, while many compilers will contract expressions if they can, few 
compilers implement this pragma. As of 2007, gcc (v4.1.1) ignores the pragma 
with a warning, and Microsoft's Visual C++ handles it as a recent addition. 

We have explained how, on certain processors such as the x87 (Section fe.l.ip . 
it was possible to change the precision of results by setting special flags — while 
no access to such flags is mandated by the C norm, the possibility of various 
precision modes is acknowledged by the norm [lSQl . ll999l F .7.2]. Furthermore, 



IEEE-754 mandates the availability of various rounding modes (Section | 
in addition, some processors offer further flags that change the behaviour of 
floating-point computations. 

All changes of modes are done through library functions (or inline assembly) 
executed at runtime; at the same time, the C compiler may do some computa- 
tions at compile time, without regard to how these modes are set. 

During translation the IEC 60559 default modes are in effect: The 
rounding direction mode is rounding to nearest. The rounding preci- 
sion mode (if supported) is set so that results are not shortened. Trap- 
ping or stopping (if supported) is disabled on all floating-point exceptions. 
[. . . J The implementation should produce a diagnostic message for each 
translation-time floating-point exception, other than inexact; the imple- 
mentation should then proceed with the translation of the program. 

In addition, programs to be compiled may test or change the floating-point 
status or operating modes using library functions, or even inline assembly. If 
the compiler performs code reorganisations, then some results may end up being 
computed before the applicable rounding modes are set . For this r eason, the C 
norm introduces #pragma STDC FENV_ACCESS ON/OFF [iSCl ll999L §7.6.1]: 



The FENV_ACCESS pragma provides a means to inform the implemen- 
tation when a program might access the floating-point environment to test 
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floating-point status flags or run under non-default floating-point control 
modes. [. , .] If part of a program tests floating-point status flags, sets 
floating-point control modes, or runs under non-default mode settings, 
but was translated with the state for the FENVJiCCESS pragma off, the 
behaviour is undefined. The default state ( on or off) for the pragma is 
implementation- defined. [. . .] The purpose of the FENVJiCCESS pragma 
is to allow certain optimisations that could subvert flag tests and mode 
changes (e.g., global common subexpression elimination, code motion, 
and constant folding). In general, if the state of FENVJICCESS is off, the 
translator can assume that default modes are in effect and the flags are 
not tested. 

Another effect of this pragma is to change how muc h the comp iler can evalu- 
ate at compile time regarding constant initialisations. jlSOlll999l F.7.4, F.7.5]. 



If it is set to OFF, the compiler can evaluate floating-point constants at compile 
time, whereas if they had been evaluated at runtime, they would have resulted 
in different values (because of different rounding modes) or floating-point excep- 
tion. If it is set to ON, the compiler may do so only for static constants — which 
are generally all evaluated at compile time and stored as a block of constants in 
the binary code of the program. 

Unfortunately, as per the preceding pragma, most compilers do not recog- 
nise this pragma. There may, though, be some compilation options that have 
some of the same effect. Again, the user should carefully read the compiler 
documentation. 

4.3.2 Optimisations and associativity 

Some optimising compilers will apply rules such as associativity, which may 
significantly alter the outcome of an algor ithm , and thus are not allowed to 
apply according to the language standard. ISOl . 1999L §5.1.2.3] 



A particularly interesting application of such permissiveness is vectorisation; 
that is, using the features of certain processors (such as IA32, x86_64 or EM64T 
processors, with a SSE unit) that enable doing the same mathematical operation 
on several operands in a single instruction. Take for instance the following 
program, which sums an array: 

double s = 0.0; 
for(int i=0; i<n; i++) { 
s = s + t [i] ; 

} 

This code is not immediately vectorisable. However, assuming that n is a 
multiple of two and that addition is associative, one may rewrite this code as 
follows: 

double sa[2] , s; sa[0]=sa[l]= 0.0; 
for(int i=0; i<n/2; i++) { 

sa[0] = sa[0] + t[i*2+0]; 

sa[l] = sa[l] + t[i*2+l]; 

} 

s = sa[0] + sa[l] ; 
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That is, we sum separately the elements of the array with even or odd indexes. 
Depending on other conditions such as memory alignment, this code may be 
immediately vectorised on processors that can do two simultaneous double pre- 
cision operations with one single vector instruction. 

If we compile the first code fragment above using Intel's ice compiler with 
options -xW -02 (optimise for Pentium 4 and compatible processors), we see 
that the loop has been automatically vectorised into something resembling the 
second fragment. This is because, by default, ice uses a "relaxed" confor- 
mance mode with respect to floating-point; if one specifies -fp-model precise, 
then the application of associativity is prohibited and the compiler says "loop 
was not vectorised: modifying order of reduction not allowed under given 
switches". Beta versions of gec 4.2 do not vectorise this loop when using 
-03 -f tree-vectorize, but they will do so on appropriate platforms with op- 
tion -f fast-math or the aptly named -f unsafe-math-optimisations. 

Another class of optimisations involves the assumption, by the compiler, 
that certain values (±oo, NaN) do not occur in regular computations, or that 
the distinction between ±0 does not matter. This reflects the needs of most 
computations, but may be inappropriate in some contexts; for instanc e, some 
comp utations on complex numbers use the fact that ±0 are distinct Kahanl . 
Il987l |. It is thus regrettable, besides being contrary the C standard, that some 
compilers, such as ice, choose, by default, to assume that NaNs should not be 
handled correctly, or that ±0 can be used interchangeably. We shall see this on 
a simple example. 

Consider the problem of computing the minimum of two floating-point num- 
bers. This can be implemented in four straightforward ways: 

1. x < y ? x : y 

2. x <= y ? x : y 

3. x > y ? y : x 

4. x >= y ? y : x 

These four expressions are equivalent over the real numbers, but they are not if 
one takes NaNs into account, or one differentiates ±0. Witness: 
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y 


x<y ? x:y 


x<=y ? x:y 


x>y ? y:x 


x>=y ? y:x 
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On SSE (see I3.1.3|) , both ice, by default, and gec with the -02 
-f fast-math option will compile all four expressions as though they were the 
first one. The reason is that the first expression maps directly to one assembly 
instruction, minss or minsd, while the others entail more complex and slower 
code. Interestingly, if the four above expression are present in the same function, 
gec -02 -f fast-math will detect that they are equivalent and simply reuse the 
same result. 

We echo William KaharF^I in deploring that some compilers allow them- 
selves, by default, to ignore language standards and apply unsafe optimisations, 



18 See for instance the essay The baleful influence of SPEC benchmarks upon floating-point 
arithmetic on Kahan's web page. 
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presumably in order to increase performance in benchmarks. Some algorithms 
are written in a certain way so as to minimise roundoff error, and compilers 
should not rewrite them in another way. Also, static analysis results obtained 
from the source code (see Section 173]) may not be applicable on the object code 
if the compilers make such unsafe transformations. We thus suggest that users 
of analysis tool operating on the source code read the documentation of their 
compiler carefully in search of discussion of "relaxed" or "strict" floating-point 
issues. 



4.4 Input/output issues 

Another possible compilation issue is how compilers interpret constants in source 
code. The C norm states: 

For decimal floating constants, and also for hexadecimal floating 
constants when FL TJIAD I jf^l is not a power of 2, the result is either 
the nearest representable value, or the larger or smaller representable 
value immediately adjacent to the nearest representable value, chosen in 
an implementation- defined manner. For hexadecimal floating constants 
when FLTJIADIX is a power of 2, the result is correctly rounded. 

This means that two compilers on the same platform may well interpret the 
same floating-point decimal literal in the source code as different floating-point 
value, even if both compilers follow C99 closely. Similar limitations apply to 
the behaviour of the C l ibrary whe n converting from decimal representations to 
floating-point variables (iSOl . Il999l §7.20.1.3]. 

Reading and printing floating-point numbers accurately is a non-trivial issue 
if the printing base (here, 10) is not a power of the computation base (binary 
in the case of IEEE-754) (Clinged . Il990l ISteele and White! . Il990l ]. There exist 



ISO 


, 1999 


1989 


, IEE 



1985L §5.6], such that printing and reading back the values should 
yield the same numbers, within certain bounds. However, we have seen that 
the standard C libraries of certain systems are somewhat unreliable; thus, one 
may prefer not to trust them on accuracy issues. Printing out exact values and 
reading them is important for replaying exact test cases. 

In order to alleviate this, we suggest the use of hexadecimal floating-point 
constants, which arc interpreted exactly. Unfortunately, many older compilers 
do not support these; also, in order to print floating-point values as hexadecimal 
easily, printf and associated functions have to support the 7 a and 7«A formats, 
which is not yet the case of all current C libraries. 



5 Example 

Arguably, many of the examples we gave in the preceding sections, though cor- 
rect, are somewhat contrived: they discuss small discrepancies, often happening 
with very large or very small inputs. In this section, we give a complete and 

19 FLT_RADIX is the radix of floating-point computations, thus 2 on IEEE-754 systems. There 
currently exist few systems with other radices. 
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realistic example of semantic problems related to differences between floating- 
point implementations (even, dependent on compilation options!). It consists of 
two parts: 

1. an algorithm for computing a modulo (such as mapping an angle into 
[—180, 180] degrees), inspired by an algorithm found in an embedded sys- 
tem; 

2. possible implementations of tabulated functions. 

The composition of the two give seemingly innocuous implementations of angu- 
lar periodic functions... which crash for certain specific values of the inputs on 
certain platforms. 

Given x, m and M (m < M), we want to compute r such that r — x is an 
integer multiple of M — to and to < r < M. The following algorithm, adapted 
from code found in a real-life critical system, is correct if implemented over the 
real numbers: 

double modulo (double x, double mini, double maxi) { 
double delta = maxi-mini; 
double deel = x-mini ; 
double q = decl/delta; 
return x - f loor(q)*delta; 

} 

Let us apply this algorithm to the following case: 

int mainO { 

double m = 180 . ; 

double r = modulo (nextaf ter (m, 0.), -m, m) ; 

} 

We recall that floor(cc) is the greatest integer less than or equal to x, and 
that nextafter(a, b ) is the next representable double value from a in the 
direction of b . nextafter(m, 0.) is thus the greatest double-precision number 
strictly smaller than 180. 

The above program, compiled by gec 3.3 with optimisation for the x87 tar- 
get, yields an output r ~ 179.99999999999997158. In such a mode, variable q is 
cached in a extended precision register; q = 1— £ with e ~ 7.893. 10~ 17 . However, 
if the program is compiled without optimisation, deel is saved to, then reloaded 
from, a double-precision temporary variable; in the process, deel is rounded to 
360, then q is 1. The program then returns r ~ -180.00000000000002842, which 
is outside the specified bounds. 

Simply rewriting the code as follows makes the problem disappear (because 
the compiler holds the value in a register): 

double modulo (double x, double mini, double maxi) { 
double delta = maxi-mini; 
return x - floor ((x-mini) /delta) *delta; 

} 

This is especially troubling since this code and the previous one look semanti- 
cally equivalent. 
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The same phenomenon occurs, even if optimisation is turned on, if we add a 
logging statement (printf ()) after the computation of decl. This is due to the 
forced spilling of the floating-point registers into main memory across function 
calls. 

Interestingly enough, we discovered the above bug after Astree would not 
validate the first code fragment with the post-condition that the output should 
be between mini and maxi: Astree was giving an interval with a lower bound 
slightly below mini. After vainly trying to prove that the code fragment worked, 
the author began to specifically search for counter-examples. In comparison, 
simply performing unit testing on a IA32 PC will not discover the problem if the 
compiler implements even simple register scheduling (which will be turned on by 
any optimisation option). Guidelines for testing generally specify that programs 
should be tested on the target system; however, unit testing on the target 
architecture will discover this problem only if using carefully chosen values. The 
above phenomenon was missed because testers, following testing guidelines, had 
tested the program at points of discontinuity, and slightly before and after these 
points, but had not tested the values just before and after these points. 

Now, one could argue that the odds of landing exactly on nextafter(180. , 
. ) are very rare. Assuming an embedded control routine executed at 100 kHz, 
24 hours a day, and a uniform distribution of values in the [—180, 180] interval, 
such a value should happen once every 4000 years or so on a single unit[^3 How- 
ever, in the case of a mass-produced system (an automobile model, for instance), 
this argument does not hold. If hundreds of thousands of systems featuring a 
defective component are manufactured every year, there will be real failures 
happening when the systems are deployed — and they will be very difficult to 
recreate and diagnose. If the system is implemented in single precision, the 
odds are considerably higher with the same probabilistic assumptions: a single 
100 Hz system would break down twice a day. 

Now, it seems that a slight error such as this should be of no consequence: 
the return value is less than the intended lower bound, but still extremely close 
to it. However, let us consider a typical application of computing modulos: 
computing some kind of periodic function, for instance depending on an angle. 
Such a function is likely to contain trigonometric operators, or other operations 
that are costly and complex to compute. A common way to work around this 
cost and complexity is to look such functions up from a precomputed table. 

One implementation technique sometimes found is to put all these tables in 
some big array of constants (perhaps loaded from a file during initialisation), 
and read from the array at various offsets: 

val = bigTable [r + 180 + FUNCTI0N_F_0FFSET] ; 

This means an implicit truncation of r+180+FUNCTI0N_F_0FFSET to 0; if r is a 
little below -180, then this truncation will evaluate to FUNCTI0N_F .OFFSET - 1. 
The table look-up can then yield whatever value is at this point in memory, 
possibly totally out of the expected range. 

Another example is a table look-up with interpolation: 

double periodicFunction(double r) { 

20 For 180, the unit at the last position is <5 = 2" 45 ; all reals in (180 - (3/2)5, 180 - 8/2) 
are rounded to 180 — <5, thus the probability of rounding a random real in [—180, 180] to this 
number is 360/(5. 
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double biased = r+180; 
double low = floor (biased) ; 
double delta = biased-low; 
int index = (int) low; 
return ( table [index] *(l-delta) 
+ table [index+1] *delta ); 

} 

If r is slightly below —180, the value return will depend on table [-1] , that is, 
whatever is in memory before table. This can result in a segmentation fault 
(access to a unallocated memory location), or, more annoyingly, in reading 
whatever is at that location, including special values such as NaN or ±00, or 
even simply very large values. With table [-1] = 10 308 and table [0] = 0, the 
above program outputs approximately 2.8 x 10 294 — a value probably too large 
for many algorithms to handle gracefully. 

Such kinds of problems are extremely difficult to reproduce if they are found 
in practise — they may result in program crashes or nondeterministic behaviour 
for very rare input values. Furthermore, they are not likely to be elicited by 
random testing^ Finally, the problem may disappear if the program is tested 
with another compiler, compilation options, or execution platform. 



6 A few remarks on Java 



Java' s early floating-point model was a strict subset of IEEE-754 [Gosling et al 



19961 §4.2.3]: essentially, strict IEEE-754 single and double-precision arith- 
metic without the exception traps (overflow, invalid operation. . . ) and without 
rounding modes other than round-to-nearest. However, strict compatibility with 
IEEE-754 single- and double-precision operations is difficult to achieve on x87 
(see Section l3.1.1|) . As a consequence, requests were made so that strict compat- 
ibility would be relaxed in order to get better performance, particularly for sci- 
entific computing applications. The possib ility of giving up Java's d eterministic, 
portable semantic s was requested by some IKahan and DarcvL[l998l ]. but contro- 



versial for others [Java Grande Forum Panel. Il998j . Finally, the Java language 



specification was altered [Gosling et all 12000. §4.2.3]: run-time computing in 
extra precision (single-extended and double-extended formats) was allowed for 



classes and methods not carrying the new strictfp modifier [Gosling et al 



I2000LI2005L §15.4]: they may be evaluated with an extended exponent range. To 
summarise, the Java designers made it legal to compile expressions to straight- 
forward x87 code. 

This is actually a bolder decision than may appear at first sight. The 
whole design of the Java language, with well-defined semantics, aimed at en- 
forcing unique semantics for single-threaded program^ that do not use system- 
dependent features (files, graphical interfaces, etc.): a program that merely did 



21 The example described here actually demonstrates the importance of non-random testing: 
that is, trying values that "look like they might cause problems"; that is, typically, values at 
or close to discontinuities in the mathematical function implemented, special values, etc. 

22 Java does not attempt to prescribe how a multi-threaded program should be executed: it 
descr ibes valid multi-threading behaviours, all of which are equally acceptable iGosling et al.L 
120051 . Chapter 17]. Some nondeterminism is thus acceptable. 
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single-threaded computations should behave identically regardless of the ma- 
chine, operating system and Java runtime system. This decision breaks this 
uniqueness of semantics and introduces platform dependencies, which Java was 
supposed to overcome. 

Let us take for instance the foo program of Section f3 . 1 . H translated into 
Java: 

class Foo { 

static double foo (double v) { 
double y = v*v; 
return (y / v) ; 

} 



public static void main (String [] args) { 
System. out. println(f oo(lE308) ) ; 

} 

} 

We use gcj (the Java compiler associated with gec) version 4.1.1 on a x86_64 
system. Unsurprisingly, when compiled to machine code for the SSE target, the 
program prints Infinity. The results get more diverse on the x87 target: 



If no optimisation is used, the program prints Infinity. Examination of 
the assembly code shows that intermediate values are spilled to memory. 

If optimisation level 1 is used (-0), then the program prints 1E308 (10 308 ). 
Intermediate values are left in registers, but functions are not inlined. 

If optimisation level 3 is used ( _ 03), the program prints Infinity. 
The compiler inlines function foo and detects the parameter to 
System. out. printlnO is a constant. It computes that constant using 
strict IEEE-754 double precision, thus the result. 

The strictfp modifier should force the program to adopt strict IEEE -754 
round -to- nearest semantics. However, this modifier is ignored by gcj [Frel 
2005al . §2.1] (and the above experiments show identical results regardless of 



strictfp). The author is even unsure that the behaviour noted above is cor- 
rect even in the absence of str ictfp: strictfp is supposed to affect only 



temporaries [Gosling et all I2005L §15.4], and y is not a temporary. 

The "normal" mode of operation of Java is not straight compilation to na- 
tive code, but compilation to bytecode; the bytecode may then be executed 
inside a Java Runtime Environment (JRE), comprising a Java Virtual Machine 
(JVM). The simplest JVMs just interpret the bytecode, but more advanced 
ones do "just-in-time compilation" (JIT). A JIT-capable virtual machine will 
typically interpret bytecode at first, but will detect frequently used functions 
and will compile these to native codes. Because JIT occurs at runtime, it can 
perform advanced optimisations, such as detecting that certain arguments occur 
frequently for certain functions and specialising these functions, that is, compil- 
ing a special version of these functions for certain values of its arguments and 
replacing the calls to the generic function by calls to the specialised function 
when appropriate. The same kind of problems that we gave above for generation 
of native code can thus occur: the interpreted version will spill temporaries to 
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memory and output one value, a JIT-compiled version may give another value, 
but if JIT specialises the function it may give a third value (or the same value as 
if interpreted). Note that JIT can thus dynamically change the semantics of a 
function for reasons unrelated to the program being executed: the programmer 
has no means to predict or control when and how JIT compilation is performed. 

We can conclude that programmers should be cautious before assuming that 
Java programs will behave predictably when using floating-point. First, this is 
only true if the strictf p modifier is used, and, even then, some very popular 
compilers and runtime systems (gcj ships by default with many GNU/Linux 
systems) ignore this modifier and may even ignore some other parts of the 
specification. The presence of JIT compilers may also add various amusing 
effects. 



7 Implications for program verification 

Program verification comprises a variety of methods whose goals is to prove 
formally that programs fit their specifications, often lumped together under 
the term "formal methods" . Formal methods have long ignored floating-point 
computations, because they were judged too baroque or too difficult to model. 

7.1 Goals of program verifications 

Purposes of verification of programs using floating-point computations may be, 
in increasing order of complexity: 

1. Proving that the program will never trigger "undefined" or "undesirable" 
behaviours, such as an overflow on a conversion from a floating-point type 
to an integer type. This problem has attracted the attention of both the 
industrial world and the program analysis community since the much pub- 
licised self-destruction of the Ariane 5 launcher during its maiden flight, 
which was due to overfl ow in a conversion from a 64-bit floating point 
value to a 16-bit integer [Lions et al 1 11996} . 



Proving that a value does not overflow entails finding some bounds for 
that value, and if that value is the result of a complex computation de- 
pending on the history of previous inputs and outputs (as is the case of e.g. 
infinite impulse response filters, rate limiters, and combinations thereof), 
then finding and proving such bounds entails proving the stability of the 
numerical computation. In many cases, though, automatic program anal- 
ysers, taking as input the source code of the program (typically in C or 
some other language, perhaps even in assembly language) can automati- 
cally compute properties of such systems. If the analyser is sound, then 
all the properties it prints (say, x < 5) hold for every run of the analysed 
programs. 

The Astree system analyses programs written in a subset of the C pro- 



grams 1 

gramming language [Cousot et all 20051 iBlanchet et all 120031 . |2002| and 



attempts bounding all variables and proving the absence of overflows and 
other runtime errors. While it is possible to specify assertions (such as 
bounds on some variables representing physical quantities with known lim- 
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its), which Astree will attempt to prove, the system is not designed to 
prove such user-defined properties. 



Pin-pointing the sources of roundoff errors in the program; proving an 
upper bound on the amount of roundoff error in some variable. 

While in the preceding class of problems, it does not matter whether the 
numerical computation makes sense as long as it does not crash or vi- 
olate some user-specified assertion, here the problems are subtler. The 
Fluctuat tool automatically provides results on the origin and magni- 
tude of roundoff err ors in numerical programs [Martell . I200H iGoubanld. 
20011 lMarteil2002bl lat. 



Proving that the program implements such or such numerical computation 
up to some specified error bound. Except in the simplest cases, automated 
methods are unsuitable; if formal proofs are desired, then computerised 
proof assistants may help. Proving that a program fits its specification 
is difficult in general, even more so over numerical programs, which do 
not have "good" algebr aic properties; yet some pr ogress has been recently 



made in that direction [Filliatre and Boldd . 12007 1. 



7.2 Semantic bases of program analysis 

In order to provide some proofs in the mathematical sense, one has to s tart with 



a ma thematical definition of program behaviours, that is, a semantics. [Winskel , 
Il993l | This semantics should model all possible concrete behaviours of the sys- 
tem, without omitting any. 

An alternative point of view is that, for the sake of simplicity of verifica- 
tion, the chosen semantics may fail to reflect some behaviours of the concrete 
system. For instance, one may consider that all floating-point variables behave 
as real numbers. We have shown, however, in Sec. 03 that very real bugs can 
occur in simple, straightforward programs that are correct if implemented over 
the real numbers, but that exhibit odd, even fatal, behaviours due to floating- 
point roundoff. Such an approach is thus risky if the goal is to provide some 
assurance that the program performs correctly, but it may be suitable for bug- 
finding systems, whose goal is not to prove the absence of bugs, but to direct 
programmers to probable bugs. Indeed, bug-findings techniques are typically 
unsound: they trade soundness (all possible bugs should be listed) for ease of 
use (not too many warnings about bugs that do not exist) and efficiency (quick 
analysis). In that context, methods that consider that floating-point variables 
contain real numbers (as opposed to floating-point values), or that integer vari- 
ables contain unbounded integers (as opposed to n-bit integers computed using 
modular arithmetic), may be relevant. 

In this paper, we are concerned with sound verification techniques: tech- 
niques that only produce correct results; that is, if the analysis of a program 
results in the analyser claiming that some conversion from floating-point to in- 
teger will not overflow, then it should not be possible to have this conversion 
overflow, regardless of inputs and roundoff errors. Given the various misunder- 
standings about floating-point that we cited in the previous sections, it is no 
surprise that it is extremely easy for an analysis designer to build an unsound 
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static analysis tool without meaning it, for instance by starting with a semantics 
of floating-point operations that does not model reality accurately enough. 

It is well-known that any method for program verification cannot be at the 
same time sound (all results produced are truthful) , automatic (no human inter- 
vention) , complete ( true results ca n always be proved) and terminating (always 
produces a result) [Cousotl . [l990]FI unless one supposes that program mem- 



ory is finite and thus that the system is available to model- checking techniques. 
Given that the state spaces of programs with floating-point variables are enor- 
mous even with small numbers of variables, and that the Boolean functions 
implementing floating-point computations are not very regular, it seems that 
model-checking for whole floating-point algorithms should not be tractable. As 
a consequence, we must content ourselves with techniques that are unsound, 
non-terminating, incomplete, or not automatic, or several at the same time. 
The purpose of this section is to point out how to avoid introducing unsound- 
ness through carelessness. 

All terminating automatic program analysis methods are bound to be incom- 
plete; that is, they may be unable to prove certain true facts. Incompleteness 
is equivalent to considering a system that has more behaviours than the true 
concrete system. Since we must be incomplete anyway, it is as well that we 
take this opportunity to simplify the system to make analysis more tractable; 
in doing so, we can still be sound as long as we only add behaviours, and not 
remove any. For instance, in Astree, most analyses do not attempt to track 
exactly (bit-by-bit) the possible relationships between floating-point values, but 
rather rely on the bound on roundoff given by inequality OH 

7.3 Difficulties in defining sound semantics 

We have seen many problems regarding the definition of sound semantics for 
programs using floating-point; that is, how to attach to each program a math- 
ematical characterisation of what it actually does. The following approaches 
may be used: 

• A naive approach to the concrete semantics of programs running on 
"IEEE- 754-compatible" platforms is to consider that a +, -, * or / sign in 
the source code, between operands of type float (resp. double), corre- 
sponds to a strict IEEE-754 ©, Q, <g), operation, with single-precision 
(resp. double-precision) operands and result: a © b = r(a + b) where r 
rounds to the target precision. As we have seen, this does not hold in 
many common cases, especially on the x87 (Section 13. 1.1[) . 

• A second approach is to analyse assembly or object code, and take the 
exact processor-specified semantics as the concrete semantics for each 
operation. This is likely to be the best solution if the compiler is not 
trusted not to make "unsafe optimisations" (see 8 34.3.21) . It is possible 
to pe rform static analysis directly o n the assembly code, or even object 
code Balakrishnan and Repsl 2004 1 . In addition, it is possible, if source 



23 The formal version of this result is a classic of recursion theory, known as Rice's theorem: 
let C be a collection of partial recursive functions of one variable, then, noting (fr x the partial 
recursive function numbered x, then {x \ <f> x £ C} has a recursive characteristi c function if 
and o nly if C is e mpty or contains all partial recursive functions of all variables. |Rogersl [19871 . 
p. 34] Rice, 1953, corollary B] This result is proved by reduction to the halting problem. 
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code in a high level language is available, to help the assembly-level static 
analyser with infor mation obtain ed from source- level analysis, through in- 



variant translation Rivall . 120031 ] . which may be helpful for instance for 
pointer information. 

One remaining difficulty is that some "advanced" functions in floating- 
point units may have been defined differently in successive generations 
of processors, so we still cannot rule out discrepancies. However, when 
doing analysis for embedded systems, the exact kind of target processor 
is generally known, so it is possible to use information as to its behaviour. 
Another possibility is to request that programmers do not use poorly- 
specified functions of the processor. 

• A third approach is to encompass all possible semantics of the source code 
into the analysis. Static analysis methods based on abstract interpreta- 
tions (Section 17. 5|) are well-suited for absorbing such "implementation- 
defined" behaviours while still staying sound. 

7.4 Hoare logic 

As an example in the difficulty of proposing simple semantics based on the 
sourc e code of the program, l et us c onsider the proof rules of the popular Hoare 



sourc e code o t the progr am, l et us c onsider tne proot rules ot the popular rioare 
logic [HoareL Il96f{ IWinskell . Il993| . These rules are the basis of many proof 



assistants and other systems for proving properties on imperative programs. A 
Hoare triple {A}c{£?} is read as follows: if the program state at the beginning 
of the execution of program c verifies property A, then, if c terminates, then the 
final program state verifies property B. A rule 

Hi . . . H n 



C 

reads as: if we can prove the hypotheses Hi, . . . ,H n , then we can prove the 
conclusion C by applying rule r. If there are zero hypotheses, then we say rule 
r is an axiom. 

We recall the classical rules for assignment, sequence of operation, and test: 

assign 



{B[x i — ► a]}x := a{B} 
{A}c {B} {B}ci{C} 



sequence 



{A}c ;ci{C} 
{AAb}c {B} {Ah^b}ci{B} 
{A}if b then c else ci{B} 



if 



In the following rule, we use hypotheses of the form h C, reading "it is 
possible to prove C in the underlying mathematical logic" . Indeed, Hoare logic, 
used to reason about programs, is parametrised by an underlying mathematical 
logic, used to reason about the quantities inside the program. 

h A A' {A'}P{B'} h B' B 
{A}PjB} W6akening 

These rules sound suitable for proving properties of floating-point programs, 
if one keeps floating-point expressions such as x © y inside formulae (if multiple 
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floating-point types are used, then one has to distinguish the various precisions 
for each operator, e.g. x (Bd V defined as r<i(x + y)). 

Hoare logic is generally used inside a proof assistant, a program which al- 
lows the user to prove properties following the rules, possibly with some partial 
automation. The proof assistant must know how to reason about floating-point 
quantities, x © y may be defined as r(x + y), where r is the appropriate round- 
ing function, x + y is the usual operation over the reals (thus, the underlying 
mathematical logic of the Hoare prover should include a theory of the reals), 
and r is the rounding function, which may be defined axiomatically. Inequalities 
such as ineq. [2j or properties such as r o r = r then appear as lemmas of the 
mathematical prover. An axiomatisa tion of the floating-point numbers is, for 
instance, used in the CADUCEUS tool [Filliatre and Boldoll2007| . 



However, the above rules, as well as all tools based on them, including 
Caduceus, are unsound when applied to programs running on architectures 
such as the x87. They assume that any given arithmetic or Boolean expres- 
sion has a unique value, depending only the value of all variables involved, 
and that this value does not change unless at least one of the variables in- 
volved changes through an assignment (direct or through pointers). Yet, in the 
program zero_nonzero . c of Sec. 13.1.11 we have shown that it is possible that 
inequality z ! = holds at a program line, then, without any instruction assign- 
ing anything to z in whatever way, that inequality ceases to hold. Any proof 
assistant straightforwardly based on Hoare logic would have accepted proving 
that the assertion in zero_nonzero . c always holds, whereas it does not in some 
concrete executions. 

Such tools based on "straightforward" Hoare rules are thus sound only on 
architectures and compilation schemes for which the points where rounding 
takes place are precisely known. This excludes: 

• The x87 architecture, because of rounding points depending on register 
scheduling and other circumstances not apparent in the source code. 

• Architectures using a fused multiply-add instruction, such as the PowerPC 
(Sec. I3.2| ). because one does not know whether x <£> y © z will get executed 
as r(r(x x y) + z) or r(xy + z). 

What can we propose to make these tools sound even on such architectures? 
The first possibility is to have tools operate not on the C source code (or, more 
generally, any high-level source code) but on the generated assembly code, whose 
semantics is unambiguously However, directly checking assembly code is stren- 
uous, since code in assembly language is longer than in higher-level languages, 
directly deals with many issues hidden in higher-level languages (such as stack 
allocation of variables), and exhibits many system dependencies. 

Another solution for the x87 rounding issue, is to replace the "simple" ax- 
iomatic semantics given above by a nondeterministic semantics where the non- 
determinacy of the position of rounding is made explicit. Instead of an as- 
signment x := a being treated as a single step, it is broken into a sequence of 
elementary operations. Thus, x := a ® (b ® c), over the double precision num- 
bers, is decomposed into t :— b ® c; x :— a © t. Then, each operation is written 
using the r e extended precision rounding function and the double precision 



24 At least for basic operations ©, ©, ®, 0, ^J. We have seen in Section l4,ll that transcen- 
dental functions may be less well specified. 
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rounding function, taking into account the nondeterminism involved, noting ndt 
a function returning true or false nondeterministically and skip the instruction 
doing nothing: 

t := r e (b x c); 

if ndt then t := rd{t) else skip; 
X := r e (a + t) 

The transformation consists in using the r e rounding operation in elemen- 
tary operations (thus a © b gets translated as r e {a + b)) and prepending an 
optional (nondeterministically chosen) r d rounding step before any operation 
on any operand on that operation. The resulting code is then suitable for 
proofs in Hoare logic, and the properties proved are necessarily fulfilled by any 
compilation of the source code, because, through nondeterminism, we have en- 
compassed all possible ways of compiling the operations. We have thus made 
the proof method sound again, at the expense of introduced nondeterminism 
(which, in practise, will entail increased complexity and tediousness when prov- 
ing properties) and also, possibly, of increased incompleteness with respect to a 
particular target (we consider behaviours that cannot occur on that target, due 
to the way the compiler allocates registers or schedules instructions). 

If we know more about the compiler or the application binary interface, we 
can provide more precise semantics (less nondeterminism, encompassing fewer 
cases that cannot occur in practice). For instance, using the standard parameter 
passing scheme on IA32 Linux for a non-inline function, floating-point values 
are passed on the stack. Thus, a Hoare logic semantics for parameter passing 
would include a necessary r s (single precision) or r d rounding phase, which could 
simplify the reasoning down the program, for instance by using the fact that 
r s or s = r s and r d o r d — r d . 

A similar technique may be used to properly handle compound instructions 
such as fused-multiply-add. On a processor where fused- multiply- add yields 
r(a x b + c), there are two ways of compiling the expression (a £g> b) © c: as 
r(r(a x b) + c) or as r(a x b + c). By compiling floating-point expressions (using 
ffi, (g> etc.) into a nondeterministic choice between unambiguous expressions over 
the reals (using r, +, x etc.) we get a program amenable to sound Hoare proof 
techniques. Again, the added nondeterminism is likely to complicate proofs. 

This technique, however, relies on the knowledge of how the compiler may 
possibly group expressions. Because compilers can optimise code across instruc- 
tions and even across function calls, it is likely that the set of possible ways of 
compiling a certain code on a certain platform is large. 

To summarise our findings: the use of Hoare- logic provers is hampered by 
platforms or language where a given floating-point expression does not have a 
single, unambiguous meaning; straightforward application of the rules may yield 
unsound results, and workarounds are possible, but costly. 



7.5 Static analysers based on abstract interpretation 

We here consider methods based on abstract interpretation, a generic frame- 
work for giving an over-approxi mation of the set of possible program execu- 
tions. [Cousot and Cousoti . [1992] An abstract domain is a set of possible sym- 
bolic constraints with which to analyse programs. For instance, the interval 
domain represents constraints of the form G\ < x < C2, the octagon abstract 
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domain Ming . |2001| constraints of the form ±x ± y < C where x and y are 



program variables and the C are numbers. 
7.5.1 Intervals 

The most basi c domain for th e analysis of floating-p oint programs is the in- 



ihe most basic domai n tor th e analysis ot tloatmg-pomt programs is the in- 
terval domain jBlanchet et all I2002L iMine . 2004al |bl]: to each quantity x in 



the program, attach an interval [m x , M x ] such that in any concrete execution, 
x € [m x ,M x ]. Such intervals [m x ,M x ] can be computed efficiently in a static 
analyser for software running on a IEEE- 754 machine if one runs the analyser 
on a IEEE-754 machine, by doing interval arithmetic on the same data types as 
used in the concrete system, or smaller ones. 

Fixing x, y 1 — ^ x © y is monotonic. It is therefore tempting, when imple- 
menting interval arithmetic, to approximate [a, b] © [a', b'\ by [a© a', 6© b'] with 
the same rounding mode. Unfortunately, this is not advisable unless one is 
really sure that computations in the program to be analysed are really done 
with the intended precision (no extended precision temporary values, as com- 
mon on x87, see Sec. l3.1.Tj) . do not suffer from double rounding effects (see 13.1.21 
and the preceding section), and do not use compound operations instead of 
atomic elementary arithmetic operations. In contrast, it is safe to use directed 
rounding: computing upper bounds in round-to-+oo mode, and lower bounds 
in round-to — 00 m ode: \a,b] © \a',b'] = [a ffi_oo o,',b (B+oo &']@ This is what 



is done in Astree Cousot et all 120051 ] . Note, however, that on some systems, 
rounding-modes other than round-to-nearest are poorly tested and may fail to 
work properly (see Section [4~2"f : the implementers of analysers should therefore 
pay extra caution in that respect. 

Another area where one should exercise caution is strict comparisons. A sim- 
ple solution for abstracting the outcome for x of x < y, where x G [m x , M x ] and 
y G [m y ,M y ], is to take M' x = min(M x ,M y ), as if one abstracted x <= y. Un- 
fortunately, this is not sufficient to deal with programs that use special floating- 
point values as markers (say, is a special value meaning "end of table"). One 
can thus take M' x = mm(M x ,pred(M y )) where pred(x) is the largest floating- 
point number less than x in the exact floating-point type used. However, as we 
showed in §3.1.1[ on x87 a comparison may be performed on extended precision 
operands, which may later be rounded to lesser precisions depending on register 
spills. We showed in the preceding section that this particularity made some 
proof methods unsound, and it also makes the above bound unsound. 

In order to be sound, we should use the pred operation on the extended preci- 
sion type; but then, because of possible rounding to lesser precisions, we should 
round the bounds of the interval... which would bring us back to abstracting < 
as we abstract <. 

We can, however, refine the technique even more. If we know that x is 
exactly representable in single (resp. double) precision, then we can safely use 
the pred function on single (resp. double) precision floats. Some knowledge of 
the compiler and the application binary interface may help an analyser establish 
that a variable is exactly representable in a floating-point type; for instance, if 
a function is not inlined and receives a floating-point value in a double variable 



25 Changing the rounding mode may entail significant efficiency penalties. A common trick is 
to set the processor in round-to-+oo mode permanently, and replace x(B—ooV by — ((— x)(B+cc 
(—«/)) and so on. 
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passed by value on the stack, then this value is exactly representable in double 
precision. More generally, values read from memory without the possibility that 
the compiler has "cached" them inside a register will necessarily be representable 
in the floating-point format associated with the type of the variable. 



7.5.2 Numerical relational domains 

The interval domain, while simple to implement, suffers from not keeping rela- 
tions between variables. For many applications, one should also use relational 
abstract domains — abstract domains capable of reflecting relations between 
two or more variables. Relational domains however tend to be designed for data 
taken in ideal structures (ordered rings, etc.); it is thus necessary to bridge the 
gap between these ideal abstract structures and the concrete execution. Fur- 
thermore, it is necessary to have effective, implementable algorithms for the 
abstract domains. 

Mine proposed three abstraction steps Minel 2004al ib] : 



From a (deterministic) semantics over floating-point numbers to a non- 
deterministic semantics over the real numbers, taking into account the 
rounding errors. 

From the "concrete" nondeterministic semantics over the reals, to some 
ideal abstract domain over the reals (M.) or the rationals (Q), such as the 



octagon abstract domain [Minel . 12001 1 



3. Optionally, the ideal abstract domain is over-approximated by some effec- 
tive implementation. For instance, wherever a real number is computed 
in the ideal domain, a lower or upper bound, or both, is computed in the 
implementation, for instance using floating-point arithmetic with directed 
rounding. 

Following a general method in abstract interpretation Cousot . Il997j . the 



succession of abstraction steps yields a hierarchy of semantics: 

concrete semantics over the floats 
C 

nondeterministic semantics over the reals 
C 

ideal abstract domain over 1 or Q 
C 

(optional) effective implementation over the floats 

The analysis method is sound (no program behaviour is ignored), but, because 
of the abstractions, is incomplete (the analyser takes into account program 
behaviours that cannot appear in reality, thus is incapable of proving certain 
tight properties). 

The first step is what this paper is concerned about: a sound modelling of 
the floating-point computations. If this step is not done is a sound manner, for 
instance if the semantics of the programming language or the target platform 
with respect to floating point are misunderstood, then the whole analysis may 
be unsound. 

The simplest solution, as proposed by Mine and implemented in Astree, 
is to consider that the "real" execution and the floating-point execution differ 
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by a small error, which can be bound as recalled in 12.21 (\x — r(x)\ < e IC \.\x\ + 
e a bs)- However, in doing so, one must be careful not to be too optimistic: 
for instance, because of possible double rounding, the error bounds suitable 
for simply rounded round-to-nearest computations may be incorrect if, due to 
possible "spills" , double rounding may occur (see I3.1.2j) . The e rc i coefficient 
must thus be adjusted to compensate possible double rounding. 

The second step depe nds on the abstract domain. For instance, for the 
octagon abstract domain [Minel l200l| . it consists in a modified shortest-path 
algorithm. In any case, it should verify the soundness property: if a constraint 
(such as x + y < 3) is computed, then this constraint should be true on any 
concrete state. Finally, the last step is usually achieved using directed rounding. 



7.5.3 Symbolic abstract domains 

Other kinds of relational domains propagate pure constraints or guards (say, 
x = y © z). An exam ple is Mine's symbolic rewrite domain implemented in 
Astree Mine . 2004aj | . Intuitively, if some arithmetic condition is true on some 



variables x and y, and neither x nor y are suppressed, then the condition con- 
tinues to hold later. However, we have seen in Section 13.11 that one should 
beware of "hidden" rounding operations, which may render certain propagated 
identities invalid. We may work around this issue using the same techniques as 
in Sec. 17.41 allow rounding at intermediate program points. 

In addition, we have also seen that certain equivalences such as x y = 
x — y are not necessarily valid, depending on whether certain modes 

such as flush-to-zero are active! 26 ! 



7.6 Testing 

In the case of embedded system development, the development platform (typi- 
cally, a PC running some variant of IA32 or x86_64 processors) is not the same as 
the target platform (typically, a microcontroller). Often, the target platform is 
much slower, thus making extensive unit testing time-consuming; or there may 
be a limited number of them for the development group. As a consequence, it 
is tempting to test or debug numerical programs on the development platform. 
We have shown that this can be a risky approach, even if both the development 
platform and the target platform are "IEEE-754 compatible" , because the same 
program can behave differently on two IEEE-754 compatible platforms. 

We have also seen that in some cases, even if the testing and the target 
platform are identical, the final result may depend on the vagaries of compila- 
tion. One should be particularly cautious on platforms, such as IA32 processors 
with the x87 floating-point unit, where the results of computations can depend 
on register scheduling. Even inserting "monitoring" instructions can affect the 
final result, because these can change register allocation, which can change the 
results of computations on the x87. This is especially an issue since it is common 
to have "debugging-only" statements excluded from the final version loaded in 
the system. In the case of the x87, it is possible to limit the discrepancies (but 
not totally eradicate them) by setting the floating-point unit to 53-bit mantissa 
precision, if one computes solely with IEEE-754 double precision numbers. 

26 There is still the possibility of considering that x y = => \x — y\ < 2 Bmin . 



3G 



Static analysis techniques where concrete traces are replayed inside the anal- 
ysis face similar problems. One has to have an exact semantic model of the 
program to be analysed as it runs on the target platform. 

8 Conclusion 

Despite claims of conformance to standards, common software/hardware plat- 
forms may exhibit subtle differences with respect to floating-point computations. 
These differences pose special problems for unit testing or debugging, unless one 
uses exactly the same object code as the one executed in the target environment. 

More subtly, on some platforms, the exact same expression, with the same 
values in the same variables, and the same compiler, can be evaluated to differ- 
ent results, depending on seemingly irrelevant statements (printing debugging 
information or other constructs that do not openly change the values of vari- 
ables). This is in particular the case on the Intel 32-bit platform, due to the 
way that temporary variables are commonly handled by compilers. This breaks 
an assumption, common in program verification techniques, that, barring the 
modification of some of the variables used in the expression (directly or through 
a pointer), the same expression or test evaluates twice to the same value. We 
have seen that this assumption is false, even providing concrete example where a 
condition holds, then ceases to hold two program lines later, with no instruction 
in the source code that should change this condition in between. 

We proposed some alterations to program verification techniques based on 
Hoare logic in order to cope with systems where this assumption does not hold. 
However, these alterations, necessary in order to preclude proving false state- 
ments on programs, make proofs more complex due to increase nondeterminism. 
With respect to automated analysis techniques based on abstract interpretation, 
we explained how to cope with the Intel 32-bit platform in a sound way, includ- 
ing with precise analysis of strict comparisons, through simple techniques. 

With respect to the development of future systems, both hardware, and 
software, including compilers, we wish to stress that reproducibility of floating- 
point computations is of paramount importance for safety-critical applications. 
All designs — such as the common compilation schemes on the Intel 32-bit 
platform — where the results can change depending on seemingly irrelevant 
circumstances such as the insertion of a "print" statement make debugging fringe 
conditions in floating-point programs difficult, because the insertion of logging 
instructions may perturb results. They also make several program analysis 
techniques unsound, leading to sometimes expensive fixes. 

The fact that some well-known compilers, by default, decide to activate 
optimisations that contradict published programming language standards and 
also break verification techniques, is not comforting. "Unsafe" optimisations 
often yield improved performance, but their choice should be conscious. 
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